Robust acoustic directional sensing enabled by synergy between resonator-based sensor and deep learning

We demonstrate enhanced acoustic sensing arising from the synergy between resonator-based acoustic sensor and deep learning. We numerically verify that both vibration amplitude and phase are enhanced and preserved at and off the resonance in our compact acoustic sensor housing three cavities. In addition, we experimentally measure the response of our sensor to single-frequency and siren signals, based on which we train convolutional neural networks (CNNs). We observe that the CNN trained by using both amplitude and phase features achieve the best accuracy on predicting the incident direction of both types of signals. This is even though the signals are broadband and affected by noise thought to be difficult for resonators. We attribute the improvement to a complementary effect between the two features enabled by the combination of resonant effect and deep learning. This observation is further supported by comparing to the CNNs trained by the features extracted from signals measured on reference sensor without resonators, whose performances fall far behind. Our results suggest the advantage of this synergetic approach to enhance the sensing performance of compact acoustic sensors on both narrow- and broad-band signals, which paves the way for the development of advanced sensing technology that has potential applications in autonomous driving systems to detect emergency vehicles.

and localization of the sound source have been treated as either classification 17,27 or regression 22,28,32 by using neural networks with input features derived from sound magnitude 23 , phase 18,34 , or their combination 19,33 .For detailed discussion on the development of using neural networks for acoustic sensing, we refer the readers to a recently published comprehensive review 35 .
While existing work has demonstrated the versatility of neural networks in a variety of acoustic sensing applications, however, most focused on refining learning algorithms, feature extractions, and network architectures.Little attention has been paid to the understanding of physical systems that directly interact with signals and the underlying physics behind the extraction of acoustic features.Recent studies 36,37 have shown improved neural network performance thanks to considerable changes in features in the presence of metamaterials.Resonators are well known for amplifying the acoustic response near the resonance.Whether the enhanced acoustic features could be effectively harnessed by neural networks for more accurate and robust acoustic sensing remains an open question.
Aiming at bridging abovementioned gaps, in this work, we fabricated and measured acoustic sensors with and without (reference) resonators on single-frequency and sirens signals and trained CNNs by different acoustic features (i.e., the amplitude, the phase, and both).Through comparing the CNN performance between resonatorbased and reference sensors, we show that resonance effect improves the training features for neural networks, significantly enhancing their accuracy on estimating sound source direction especially when amplitude and phase are simultaneously used in the training.For our subwavelength device, it may be intuitive to expect less importance from the phase due to limited spatial separation between detectors.We demonstrate that CNNs effectively differentiate and extract phase features that may be overlooked by conventional methods.For our resonator-based sensor, we consistently observe better accuracies from the CNNs trained using both amplitude and phase extracted from single-frequency (narrowband) and siren (broadband) signals, despite the signals contain noises, which demonstrate a beneficial synergy between resonance effect and deep learning as well as the robustness of this hybrid approach.

Acoustic sensor and experimental measurement
The workflow of performing the acoustic sensing by employing CNN is shown in Fig. 1.Our aim is to investigate the use of CNN in combination with acoustic resonators for spatial localization of stationary sound sources.While recurrent neural networks (RNNs) are better suited for tasks that require modeling temporal dependencies, such as predicting the direction of a moving sound source, we believe that CNNs are more effective for our specific purpose in this work as they provide effective separation of the phase and amplitude information as inputs.For this reason, we have chosen CNNs rather than experimenting on various deep learning algorithms.
Time-domain signals are experimentally obtained from our fabricated acoustic sensor.Here, we provide a brief description of the sensor design and measurement setup and further details can be found in 16 .The 3D-printed resonator-based sensor (Markforged Mark Two, Markforged Inc.) possesses three resonant cavities and its diameter, wall thickness, and the height are d = 52 mm, t = 4 mm, and h = 25 mm, respectively, while the slit has a height of h s = 10 mm and width of w s = 2 mm, respectively.In our experiments, three surface microphones (130B40, PCB Piezotronics) are positioned inside the cavities and a sound source is placed at different angles in the xy-plane with respect to the sensor to produce the incoming sound field.Different incident angles (we have chosen from 0 to 350° by every 10 • , leading to 36 incident angles) are realized by a motorized rotation stage (PRMTZ8, Thorlabs) on which the sensor is mounted while the sound source remains still.The responses of the three microphones are collected for 3 s at a sampling rate of 96 kHz through a data acquisition system (NI USB-4431, National Instruments).The measurements were performed in a study room (11 ft wide by 10 ft long) and the distance from the loudspeaker (5″ diameter) and the acoustic sensor is fixed at 0.3 m.The sensor is carefully positioned on the axis of the loudspeaker to minimize the deviation of sound amplitude due to off-axis positioning.This environment, which includes undesired reflection from surroundings, allows us to evaluate the robustness of our hybrid approach.While we do not consider varying the distance between the source and sensor, previous studies have demonstrated estimating the distance from the sound source [38][39][40][41] .More details about the design of the sensor and the experimental measurement can be found in Section S1 of the Supplementary Materials.

Acoustic sensing via deep learning of single-frequency or siren sources
To extract the amplitude and/or phase feature for deep learning, the time-domain signal (3-s long) is converted into a spectrogram 37,[42][43][44] through short time Fourier transform (FFT window size of 10.7 ms, hopping of 5.3 ms, and with the amplitude being in dB scale), which is a two-dimensional (2D) time-frequency representation, whose real and imaginary components yield amplitude and phase.The spectrogram is regarded as an 2D image and is subsequently fed into a CNN, which is known to be good at handling image recognition tasks.For three microphones, the CNN input data consists of 6 channels associated with the real and imaginary components from each of the 3 resonators.When the network is trained by both features, all 6 channels are used, whereas the training by either amplitude or phase feature utilizes the corresponding 3 channels.In the CNN, as shown in Fig. 1, each 2D convolutional layer is followed by a dropout and a max pooling layer to prevent overfitting.L 2 normalization is implemented for the same purpose.The input layer is activated by the hyperbolic tangent (tanh) function, while the subsequent layers are instead activated by the exponential linear unit function.The CNN prediction of source direction angle is regarded as a classification task with 36 classes 17,45,46 corresponding to 36 angles consistent with experiments (10° internal in 0-350° range).The CNN predicts an angle among these as the output.More details about the CNN architecture can be found in Fig. S2 and Section S2 of the Supplementary Materials.
For the siren sound, the CNN predicts both the incident angle and the type of the siren.In this work, we have used the sirens of ambulance and firetruck pre-recorded and available from open-source data 47 .Each recorded file contains only one type of siren (so that we do not tackle multiple sirens that are simultaneously active), with a variety of background noises such as running cars nearby, horns, bus stops, and pedestrian talking, etc.We have carefully inspected the siren quality (i.e., reasonably clear and does not sound too distant) and have chosen 18 recordings (8 and 10 for ambulances and firetrucks, respectively) for the experiment, in which for each incident angle the measurement is repeated 6 times for a siren.With the total 36 incident angles, we end up preparing 3888 samples, which are split at random, by 80%, 10%, and 10% into the training, validation, and testing datasets, respectively.The CNN trained by datasets split this way works well and hence, throughout this work, we skip additional k-fold validation.The output layer of the network that predicts the angle and type is activated by the softmax function with 36 and 2 classes, and the respective loss functions adopted in the Adam optimizer are categorical cross entropy and the binary cross entropy, which are assigned equal weights.Throughout this paper, when quantifying the performance of the trained CNN, we have defined the validation and test accuracy both as an exact match between the prediction and the ground truth, though, in the literature, a tolerance range has also been used 46 .
While it is intuitive to expect our sensor to work better around the resonance, we will show later that the combination with the CNN enables good performance for siren sources, which cover wide range of off-resonant frequencies.As a complement, we measure the sensor's response to single-frequency signals following abovementioned procedure while repeating the measurement for each incident angle 4 times, yielding 3744 samples (i.e., 4 × 26 frequencies × 36 angles).In this case, the CNN only predicts the incident angles unlike both the angles and the types for the siren.We also take one step further by including the scenario where two sources are active simultaneously.To this end for simplicity, we synthesize the data by combining the previously-measured single source responses at two incident angles drawn from the total of 36n possible pairs, which can be expressed by the combination C(36n, 2) with n being the number of frequencies.This eliminates the necessity of experimentally exhausting all possibilities of incident angles over the considered frequency range, which would be prohibitively time-consuming.

Resonant characteristics of the acoustic sensor
The responses of the acoustic sensor with and without resonators (reference case) are compared by conducting numerical simulations using the commercially available software COMSOL Multiphysics 5.3.The resonator-based sensor is modified to have a lower resonance frequency and to accommodate surface microphones for robust sensing, as indicated by the circles in Fig. 2a [consistent color codes are used in Fig. 2b and c].The acoustic pressure is calculated inside each cavity when the incident acoustic field comes from different angles ranging from 0 to 350°.The losses are considered by using the narrow region acoustics, which defines a fluid model for viscous and thermal boundary-layer-induced losses occurring in the slits (i.e., the openings of the Helmholtz resonators).For a fair comparison, the acoustic pressure of the sensor without resonant cavities ["no resonator (ref. )" case as depicted in Fig. 2a] is computed at the three locations matching those of the microphones in the resonator-based sensor.In Fig. 2b, we compare the acoustic pressures inside three cavities, namely, p 1 , p 2 , and p 3 , by showing the vibration amplitudes |x 1 | , |x 2 | , and |x 3 | since p j = −p A γ s S j x j /V j j = 1, 2, 3 with p A , γ s , S j , and V j being the atmospheric pressure, the ratio of the specific heat (1.4 for air), the cross-section area of the jth slit, and the volume of the jth cavity.We have chosen the incident angle of θ inc = 40° as an example.We can see that over the range of frequency from 700 to 1200 Hz, three well-defined resonant peaks appear at a frequency around 900 Hz, which corresponds to the designed resonant frequency of the sensor.Notably, the peak magnitudes differ due to the positions of resonant cavities relative to the incident acoustic field.Particularly, in Fig. 2b, the incident sound hits the cavity 1 and then reaches cavities 2 and 3, whose amplitudes slightly reduce because of acoustic shadowing effect.Interestingly, the resonators do not work as filters, which remove acoustic features away from resonant frequency.Instead, even at 1200 Hz, the amplitudes for these frequencies approach the reference values, which are represented by a dashed horizontal line in Fig. 2b.Experimental results are shown by symbols at an interval of 20 Hz, which achieve a fair agreement with the simulation data.In Fig. 2c, variations of acoustic pressures at 900 Hz with respect to incident angle θ inc are illustrated with an interval of 10°.We note that for each θ inc a unique combination of (|x 1 |, |x 2 |, |x 3 |) can be found for the sensor with the resonators, whereas for the reference case, no variable appears.The experimental results agree with the trend of the simulations with slight mismatches.We note that unlike our previous work 16 which performed the measurement in a well-controlled environment (2D waveguide) and used anechoic foams to minimize reflection from the surrounding, the current measurements were conducted in a study-room environment without any countermeasure to undesired reflections.Additionally, only the sensor was exposed to the incoming sound waves in previous experiments 16 ; however, the current setup also includes reflections from the fixtures (e.g., rotation stage) and possible misalignments between sound source and the sensor.Despite a high signal-to-noise ratio (SNR) of 30 dB [Fig.S3a of the Supplementary Materials], which indicates room quietness, SNR does not measure unwanted reflections and misalignment, which could be a potential reason for the discrepancy observed between our previous study and Fig. 2c in this work.Similarly, in Fig. 2d, the phase in each cavity with respect to θ inc is displayed.Non-intuitively, the phases of incident sound after passing through the resonators are also enhanced in comparison with the reference case, though not as significant as the amplitudes.This simulation observation is further confirmed by the experimental data, showing good agreements in both the trend and magnitude.Figure 2c and d clearly demonstrate the advantage of resonator-based acoustic sensors in preserving and enhancing useful amplitude and phase features, which will benefit angle sensing performance as we will see later.
To further understand the resonant enhancement of both amplitude and phase, we look into an analytical model, which is given by 48,49  where x j is the displacement of the mass ( m j ) of the jth resonator, j is the loss [kg/s], k j is the spring constant, Ŵ j0 is the leakage [kg/s], Ŵ ij is the coupling between ith and jth resonators [kg/s], and F j is the force acting on the jth resonator.Because of the discrete rotation symmetry, Eq. ( 1) is further simplified with m j = m , γ 0 = Ŵ jj /m j , δ = � j /m j , γ = Ŵ ij /m j , k = k j , and ω 2 0 = k j /m j ( ω 0 : resonance frequency).The resonators in our acoustic sensor couple with each other via the radiation leakage through the environment.Analytically, it reads ))e in(θ i −θ j ) , which quantifies the leakage rate between the ith and jth resonators, where k w and ω refer to the wavenumber and the angular frequency, ρ is the mass density, S and R denote the slit cross-section area and the cylinder radius, H n and H n ′ are the nth-order Hankel function and its derivative, and θ i(j) is the angle determined by the ith resonator position.We note that the damping coefficients γ , γ 0 , and δ were extracted numerically and have typical values of γ 0 ≈ 0.02ω 0 , γ 0 ≈ 0.025ω 0 − 0.06ω 0 , and δ ≈ 0.1ω 0 , respectively.Alternatively, they can be analytically found.In Fig. S4 of the Supplementary Materials, we provide a comparison between simulation results and analytical results.The analytical results are based on harmonic oscillator model using numerically obtained damping coefficients, which shows good agreement.In the frequency domain, Eq. ( 1) can be explicitly written for three resonators (N = 3) as, where A = −ω 2 + ω 2 0 − iω(γ 0 + δ) , and f j is the force acting on the jth resonator, i.e., f j = F j /m = f 0 e iθ j with θ j = (2πR/ ) × cos 2π j − 1 /3 − θ inc .From Eq. ( 2), the displacement ratio between resonators 1 and 2 is expressed as Equation ( 3) clearly indicates that the ratio x 1 /x 2 in the presence of resonators is different from the ratio without resonance (i.e., x 1 /x 2 ≈ f 1 /f 2 ), suggesting the enhancement of the amplitude ( α ) and phase ( β ) owing to the coupled resonance.In Fig. 2e and f, we show the maximized ratio of acoustic pressures in two cavities max x 2(3) /x 1 and the corresponding phase difference ( ∠ x 2(3) /x 1 ) for frequencies ranging from 800 to 1300 Hz.The maximum values are determined considering all possible θ inc within 0--350°.Near 900 Hz, peaks are seen due to the resonance.Additionally, both max x 2(3) /x 1 and the phase increase with the frequency for the resonator-based sensor, which evidently surpass those of the reference cases [dashed curves in Fig. 2d] and implies that the contrast in measured signals, in terms of both amplitudes and phases, between the microphones increases for higher frequencies.The symbols capture the trend of the simulation results and the mismatch may again be attributed to the fabrication uncertainty and the noise stemming from the measurement environment.

CNN performance on single-frequency sources
Since the enhancement effect is most effective near the resonance, we start with training the CNN using the data collected from single-frequency signals over 800-1300 Hz (i.e., training used all single-frequency data in the range, while the prediction was made at an individual frequency within the range).We have separately trained three CNNs based on either amplitude or phase feature or both.As shown in Fig. 3a, after training the CNN for 400 epochs, the validation accuracy on predicting the incident angle for the network trained with both amplitude and phase features reached 98.9%, which slightly exceeds that trained by the phase feature (97.8%), while far outscores the accuracy achieved by the one trained based on the amplitude feature (only 79.7%).Notably, using phase alone appears to be better than using both amplitude and phase when epoch is less than 250; when the epoch is less than 100, the case of using both is slightly better than using phase alone.This is further confirmed by comparing Fig. S5b and S5c of the Supplementary Materials, where between 800 and 1100 Hz, using phase alone is slightly better than using both.Since the enhancement of the amplitude is strong near the resonance, the validation accuracy is expected to be poorer away the resonant frequency [see Fig. S5a in Section S5 of the Supplementary Materials].Interestingly, for the phase feature, lower frequencies still provide enough differentiability to increase the validation accuracy, whereas reduced accuracy is observed much beyond the resonance, which may be somehow against the intuition, however, corroborates the simulation results that the resonators do not work as filters [see Fig. S5b in Section S5 of the Supplementary Materials].When two features are used during the training, the validation accuracy is improved both below and beyond the resonant frequency, indicating these features can complement each other and create a synergy to enhance the sensing capability, seemingly suggests some interferences between the two features due to the resonators [see Fig. S5c in Section S5 of the Supplementary Materials].It is noteworthy that networks trained by both features and the phase feature lead to comparable accuracies beyond ~ 200 epochs, yet training two features requires the minimization of both loss functions, which might have caused more epochs than training either feature to reach similar accuracies.We also remark that even though resonance effects are most prominent near the resonant frequency, they are still present across a broader range of frequencies, such as 700-1200 Hz, as demonstrated in Fig. 2b, e, and f.Consequently, the resonators contribute to maintaining a satisfactory validation accuracy even for frequencies below its resonant frequency, ensuring the effectiveness of the system across a wider frequency spectrum.We provide dataset, reaching 98.6% after 1600 epochs of training.Notably, when trained with only the phase feature, the CNN's performance on the validation set drops to 90.9% after 1600 epochs.When only the amplitude feature is utilized in the training, the validation accuracy significantly reduces to 60%.The comparison obviously indicates that CNN is most effective in predicting the incident angle of sirens when employing both amplitude and phase features.In Fig. 4b-d, the trained CNNs are further checked on testing data unseen during the training and validation.Consistent with Fig. 4a, when trained by both features (Fig. 4b), the predicted incident angle mostly falls within the range of ± 10° from the ground truths, achieving a test accuracy of 93.8%; when trained by either the phase (Fig. 4c) or the amplitude (Fig. 4d) feature, the test accuracies become 85.6% and 56.1%, respectively.The predicted incident angles deviate more from the ground truths and the CNNs produce more predictions with slightly large errors.We do not observe notable preference of better predicting either siren type from Fig. 4b-d 4a-d] measured by reference sensor without resonators.In sharp contrast to Fig. 4a, the CNNs performances are significantly poorer on validation data.Specifically, when trained only by amplitude, the CNN accuracy reaches 66.6%, while a reduced accuracy of 59.7% is obtained when trained by phase.This indicates that the reference sensor cannot provide sufficiently useful features for networks to achieve high accuracy as the resonator-based sensor.The network trained by both features has the worst validation accuracy of only 45.6%, which contradicts the trend observed in Fig. 4a.This suggests that, without resonators in the sensor, the two features negatively affect each other such that when used together the CNN accuracy is further impaired.These trained networks are subsequently evaluated on test data.In Fig. 4f-h, we can see that the network trained with both features realizes only 42.2% accuracy, followed by the ones trained with either phase or amplitude reaching 63.2% and 64.5%, respectively.The reductions are beyond 50% and 20% when comparing  the CNNs trained by both means do not show remarkable difference depending on the way the data was divided (i.e., by the recording or by the incident angle), suggesting that the data does not contain bias and none of the improvement in the directional sensing is expected to arise from biased data.

Conclusions
We have shown that after passing through a compact resonator-based acoustic sensor, both amplitude and phase of the incoming sound are enhanced near the resonant frequency and this acoustic information is not filtered by the resonators at off-resonance frequencies.Based on the resonance effect, convolutional neural networks have been trained using either amplitude or phase, or both features.When single-frequency signals are used, the network trained by both features demonstrates the best accuracy on predicting the angle of incident of either one or two sources.When siren signals with various background noises are utilized, the network consistently shows the highest accuracies on predicting the incident angle and the type of a single siren source despite the signals are broadband.Through comparison with the performance of CNNs trained with data collected on reference sensor without resonators which consistently show much lower accuracies, our results indicate the synergy between the resonance effect and deep learning for improving the performance of acoustic sensors, especially when designed in compact sizes, under which condition that the phase difference between detectors becomes increasingly difficult to be distinguished using conventional approaches.We hope that these results will help to pave the way for the development of compact, high-accuracy acoustic sensors for various applications including detection of emergency vehicles in autonomous driving systems.

Figure 1 .
Figure 1.Workflow of the deep-learning assisted acoustic angle sensing of sirens.

Figure 2 .
Figure 2. (a) Schematic of the resonator-based acoustic sensor (top) having three resonant cavities and the acoustic sensor without resonators (bottom, referred to as reference).When measuring the reference sensor, the microphones are inserted into the same positions as those of the resonator sensor for fair comparison.Dashed circles indicate the positions where acoustic pressures are calculated in the simulations.The legends "Exp." and "Sim." correspond to the experimental measurements and the FEM simulations.(b) Spectra of acoustic pressures in three cavities of the resonator-based sensor when incident angle is θ inc = 40°.Variations of (c) acoustic pressures and (d) phases in three cavities at 900 Hz for different incident angles.(e) Maximized ratio between acoustic pressures in two cavities and (f) phase difference over the range of frequency from 800 to 1300 Hz.
; all three trained networks predict the siren type by 100% accuracy.The performance on the testing set again confirms the benefit of training the CNN with both features for enhanced acoustic sensing of sirens.Compared to Fig. 3a, in which single-frequency signals at 800 Hz-1300 Hz are employed, in Fig. 4, the siren signals themselves cover a much broader range of frequency, not to mention additional frequency components introduced by the background noise [see Fig. S3b-S3d in Section S4 of the Supplementary Materials for more details on the noise analysis].By combining the resonator-based acoustic sensor and deep learning, the advantage of resonance effect which intuitively is expected to work for narrowband signals instead enable good sensing performance for broadband signals.To further demonstrate the benefit of the resonance effect, in Fig. 4e, we show the validation accuracies of the CNNs trained by amplitude, phase, and both feature extracted from siren signals [same ones used in Fig.
Fig. 4b with 4f and Fig. 4c with 4g.Despite an exception observed in Fig.4h, where the reference sensor marginally outperforms the resonator-based sensor (both accuracies are comparably low), the side-by-side comparison in Fig.4clearly show the benefit of having resonators in the sensor to be more effectively paired with deep learning for improving the estimation of sound source direction.In the Section S10 of the Supplementary Materials, we provide the timedomain signal of representative ambulance and firetruck siren signals and their corresponding spectrograms.To ensure the data is not biased by the repeated measurement of siren signals, the training, validation, and test datasets are produced by alternative means.In one means, samples belonging to a specific recording, i.e., 1 to 6, are used for testing, while the remaining are utilized for training and validation.In the other, samples associated to a specific incident angle, i.e., 0 to 350° by every 10°, are chosen as the test data and the rest as the training and validating data for the CNN.As illustrated in Figs.S11 and S12 in the Supplementary Materials, test results of

Figure 4 .
Figure 4. (a) Validation accuracies of CNNs trained by amplitude, phase, and both features extracted from siren signals measured by resonator-based sensor.(b)-(d) confusion matrix plots that quantify the performance of corresponding CNNs in (a) on test data (accuracy values shown by the blue font).(e) validation accuracies the CNNs trained by amplitude, phase, and both features extracted from siren signals measured by reference sensor (no resonators).(f)-(h) confusion matrix plots and accuracies of CNN on test data corresponding to the three scenarios in (e).